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SIMULTANEOUS INFERENCE: WHEN SHOULD HYPOTHESIS 
TESTING PROBLEMS BE COMBINED? 

By Bradley Efron 

Stanford University 

Modern statisticians are often presented with hundreds or thou- 
sands of hypothesis testing problems to evaluate at the same time, 
generated from new scientific technologies such as microarrays, med- 
ical and satellite imaging devices, or flow cytometry counters. The 
relevant statistical literature tends to begin with the tacit assump- 
tion that a single combined analysis, for instance, a False Discovery 
Rate assessment, should be applied to the entire set of problems at 
hand. This can be a dangerous assumption, as the examples in the 
paper show, leading to overly conservative or overly liberal conclu- 
sions within any particular subclass of the cases. A simple Bayesian 
theory yields a succinct description of the effects of separation or com- 
bination on false discovery rate analyses. The theory allows efficient 
testing within small subclasses, and has applications to "enrichment," 
the detection of multi-case effects. 

1. Introduction. Modern scientific devices such as microarrays routinely 
provide the statistician with thousands of hypothesis testing problems to 
consider at the same time. A variety of statistical techniques, false discovery 
rates, family- wise error rates, permutation methods etc., have been proposed 
to handle large-scale testing situations, usually under the tacit assumption 
that all available tests should be analyzed together — for instance, employ- 
ing a single false discovery analysis for all the genes in a given microarray 
experiment. 

This can be a dangerous assumption. As my examples will show, omnibus 
combination may distort individual inferences in both directions: highly sig- 
nificant cases may be hidden while insignificant ones are enhanced. This 
paper concerns the choice between combination and separation of hypothe- 
sis testing problems. A helpful methodology will be described for diagnosing 
when separation may be necessary for a subset of the testing problems, as 
well as for carrying out separation in an efficient fashion. 
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FlG. 1. Brain Data: z-values comparing 6 dyslexic children with 6 normals; horizontal 
section showing 848 of 15443 voxels. Code: Red z > 0, Green z < 0; solid circles z > 2.0, 
solid sguares z < —2.0; "x" indicates distance from back of brain; y-axis is right-left 
distance. The front half of the brain appears to have more positive z-values. Data from 
Schwartzman, Dougherty and Taylor (2005). 



Figure 1 illustrates our motivating example, taken from Schwartzman, 
Dougherty and Taylor (2005). Twelve children, six dyslexic and six normal, 
received Diffusion Tensor Imaging (DTI) brain scans, an advanced form of 
MRI technology that measures fluid diffusion in the brain, in this case at 
N = 15443 locations, each represented by its own voxel's response. A z-value 
"zj" comparing the dyslexics with the normals has been calculated for each 
voxel i, such that Zj should have a standard normal distribution under the 
null hypothesis of no dyslexic-normal difference at that brain location, 

(1.1) theoretical null hypothesis : Zi ~ iV(0, 1). 

The z-values for a horizontal section of the brain containing 848 of the 
15443 voxels are indicated in Figure 1. Distance "a;" from the back toward 
the front of the brain is indicated along the x axis. In appearance at least, 
the Zj's seem to be more positive toward the front. 

The investigators were, of course, interested in spotting locations of gen- 
uine brain differences between the dyslexic and normal children. A standard 
False Discovery Rate (FDR) analysis described in Section 2, Benjamini and 
Hochberg (1995), returned 198 "significant" voxels at threshold level q = 0.1, 
those having zi > 3.02. The histogram of all 15443 z^'s appears in the left 
panel of Figure 2. 

Separate z-value histograms for the back and front halves of the brain are 
displayed in the right panel of Figure 2, with the dividing line at x = 49.5 
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Fig. 2. Left panel: histogram of all 15433 z-values for Brain Data; the 198 voxels with 
Zi > 3.02 were judged significant by an FDR analysis with threshold level q — 0.1. Right 
panel: histograms for back and front voxels; separate FDR analyses at level q — 0.1 gave no 
significant voxels for back half, and 281 significant voxels, those with Zi > 2.69, for front 
half. MLE values are means and standard deviations for normal densities fit to the centers 
of the two histograms, as explained in Section 3. 

as shown in Figure 1. Two discrepancies strike the eye: the heavy right tail 
seen in the combined histogram on the right comes almost exclusively from 
front-half voxels; and the center of the back-half histogram is shifted leftward 
about 0.35 units compared to the front. 

Separate FDR analyses, each at threshold level q = 0.1, were run on the 
back and front half data. 281 front half voxels were found significant, those 
having Zi > 2.69; the back half analysis gave no significant voxels at all, 
in contrast to 9 significant back-half cases found in the combined analysis. 
This example illustrates both of the dangers of combination — over and under 
sensitivity within different subclasses of the experiment. 

Section 2 begins with a simple Bayesian theorem that quantifies the choice 
between separate and combined analysis. It is applied to the brain data in 
Section 3, elucidating the differences between front and back false discov- 
ery rate analyses. The theorem is most useful for separately investigating 
small subclasses, where there is too little data for the direct empirical Bayes 
techniques of Section 3. Sections 4 and 7 demonstrate how all the data, all 
N = 15833 z values in the Brain study, can be brought to bear on efficient 
FDR estimation for a small subclass, for example, just the 82 voxels at dis- 
tance x = 18. Section 6 applies small subclass theory to "enrichment," the 
assessment of a possible overall discrepancy between the z-values within and 
outside a chosen class. 
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A reasonable objection to performing separate analyses on portions of the 
N cases is the possibility of weakening control of Type 1 error, the overall 
size of the procedure. This question is taken up in Section 5, where false 
discovery rate methods are shown to be nearly immune to the danger. Some 
technical remarks and discussion end the paper in Sections 8 and 9. 

The question of separating large-scale testing problems has not received 
much recent attention. Two relevant references are Genovese, Roeder and 
Wasserman (2006), and Ferkinstad, Frigessi, Thorleifsson and Kong (2007). 
Enrichment techniques have been more actively developed, as in Subrama- 
nian et al. (2005), Newton et al. (2007) and Efron and Tibshirani (2007). 

2. A separate-class model. The "Two-Groups model," reviewed below, 
provides a simple Bayesian framework for the analysis of simultaneous hy- 
pothesis testing problems. The framework will be extended here to include 
the possibility of separate situations in different sub-classes of problems — 
for example, for the back and front halves of the brain in Figure 1 — the 
extended framework being the "Separate-Class model." 

First, we begin with a brief review of the Two-Groups model, taken from 
Efron (2005, 2007a). It starts with the Bayesian assumption that each of the 
N cases (all N = 15443 voxels for the Brain Data) is either null or nonnull, 
with prior probability po or p\ = 1 — po, and with its test statistic u z" having 
density either fo(z) or fi(z), 

Po = Probjnull}, fo( z ) density if null, 

(2.1) 

pi = Prob{ nonnull}, f\{z) density if nonnull. 
The theoretical null model (1.1) makes fo(z) a standard normal density 

(2-2) h{z) = v(z) = -^=e- l l 2 *\ 

an assumption we will question more critically later. We add the qualitative 
requirement that po is large, say, 

(2.3) po > 0.90, 

reflecting the usual purpose of large-scale testing, which is to reduce a vast 
set of possibilities to a much smaller set of interesting prospects. 

Model (2.1) is particularly helpful for motivating False Discovery Rate 
methods. Let Fq(z) and F\(z) be the cumulative distribution functions 
(c.d.f.) corresponding to fo(z) and fi(z), and define the mixture c.d.f. 



(2.4) 



F(z)= Po F (z)+p 1 F 1 (z). 
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Then the a posteriori probability of a case being in the null group of (2.1), 
given that its z-value m is less than some threshold z, is the "Bayesian false 
discovery rate" 

(2.5) Fdr(z) = Prob{null|z 4 < z} = p F (z)/ F(z). 

[It is notationally convenient here to consider the negative end of the z-scale, 
e.g., z = —3, but we could just as well take Zi > z or \zi\ > z in (2.5).] 

Benjamini and Hochberg's (1995) false discovery control rule estimates 
Fdr(z) by 

(2.6) m(z)=p F (z)/F(z) 1 
where F(z) is the empirical c.d.f. 

(2.7) F(z) = #Ui < z}/N. 

Having selected some control level "q" often q = 0.1, the rule declares all 
cases as nonnull having Zi < z max , where z max is the maximum value of z 
satisfying 

(2.8) Fd7(z max ) < q. 

[Usually taking po = 1 and Fq(z) the theoretical null c.d.f. $(z) in (2.5).] 

Rule (2.8), which looks Bayesian here, can be shown to have an important 
frequentist "control" property: if the Zi's are independent, the expected pro- 
portion of false discoveries, that is, the proportion of cases identified by (2.8) 
that are actually from the null group in (2.1), will be no greater than q. Ben- 
jamini and Yekutieli (2001) relax the independence requirement somewhat. 
Most large-scale testing situations exhibit substantial correlations among 
the z values — obvious in Figure 1 — but dependence is less of a problem for 
the empirical Bayes approach to false discovery rates we will follow here [see 
Efron (2007a, (2007b))]. 

Defining the mixture density f(z) from (2.1), 

(2-9) f(z)=Pofo(z)+p 1 f 1 (z) 

leads to the "local false discovery rate" fdr(z), 

(2.10) fdr(z) = Prob{null \z % = z} = Po f (z)/f(z) 

for the probability of a case being in the null group given z-score z. Densities 
are more natural than the tail areas of (2.5) for Bayesian analysis. Both will 
be used in what follows. 

The Separate- Class model extends (2.1) to cover the situation where the 
N cases can be divided into distinct classes, possibly having different choices 
°f P0j /o an d fi - Figure 3 illustrates the scheme: the two classes "A" and "B" 
(front and back in Figure 2) have a priori probabilities tva an d ttb = 1 — t^a- 
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Fig. 3. A Separate- Class model with two classes: The N cases separate into classes A or 
B with a priori probability tta or tvb; the Two-Groups model (2.1) holds separately, with 
possibly different parameters, within each class. 



The Two-Groups model (2.1) holds separately within each class, for example, 
with po = pao, fo(z) = /ao(z)> an d fi(z) = fAi(z) m class A. It is important 
to notice that the class label, A or B, is observed by the statistician, in 
contrast to the null/nonnull dichotomy, which must be inferred. 

Our previous definitions apply separately to classes A and B, for instance, 
following (2.9)-(2.10), 

f A (z) =PAofAo{z) +PAifAi(z) and 

(2.11) 

fdr A {z) =PAofAo{z)/fA{z). 
Combining the two classes in Figure 3 gives marginal densities 

h(z) = TTAfA0{z) + KBfBo(z), 

(2.12) 

fl{z) = 7Ta/A1 \Z) + TTBfBl(z), 

and po = ttaPAo + n B PB , so 

(2.13) f(z)=TT A fA(z)+ir B f B (z), 

leading to the combined local false discovery rate fdr(z) = Po fo{z) / f (z) as 
in (2.10). The same relationships with c.d.f.'s replacing densities apply to 
tail area Fdr's, (2.5). 

Bayes theorem yields a simple but useful relationship between the separate 
and combined false discovery rates: 



Theorem. Define tta(z) as the conditional probability of class A given 

z, 

(2.14) Tr A (z) = Prob{A\z}, 
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and also 

(2.15) TT A0 (z) = Prob {A\z}, 

the conditional probability of class A given z for a null case. Then 



(2.16) fdr^(z) =fdr(z) 



n A {z) ' 



Proof. Let I be the event that a case is null, so I occurs in the two 
null paths of Figure 3 but not otherwise. Then, from definition (2.10), 

idi A (z) _ Prob{J|A,z} _ Pmb{I,A\z} 
fdrTi) ~ ProbTTli} ~ Prob{A\z}Pmb{I\z} 

(2.17) 

_ Prob{A|J,z} _ 7taq(z) 

Prob{A|2;} tt a (z) ' □ 

The Theorem has useful practical applications. Section 4 shows that the 
ratio 

(2.18) R A (z)=n A0 (z)/Tr A (z) 

in (2.16) can often be easily estimated, yielding helpful diagnostics for pos- 
sible discrepancies between fdr A (z) and fdr(z), the separate and combined 
false discovery rates. 

Tail-area false discovery rates (2.5) also follow (2.16), after the obvious 
definitional changes, 

(2.19) Fdr A (z) = Fdr(z) ■ R A (z), 

where now R A {z) involves probabilities for cases having zi < z, 

„ . . Probn|j4|z,- < z\ 
2-20 R A (z) = 01 l ~ [ . 

rnob\A\Zi < z) 

There is no real reason, except expositional clarity, for restricting atten- 
tion to just two classes. Section 4 briefly discusses versions of the Theorem 
applicable to more nuanced situations — in terms of Figure 1, for example, 
where the relevance of other cases to the fdr at a given "x" falls off smoothly 
as we move away from x. First though, Section 3 applies the Theorem to 
the dichotomous front-back Brain Data analysis. 



3. Class-wise Fdr estimation. The Theorem of Section 2 says that sep- 
arate and combined local false discovery rates are related by 

(3.1) fdr A (z) = fdr(z) • R A (z), R A (z) = tt m (z)/tt a (z), 

where ir A o(z) and ir A (z) are the conditional probabilities Probo{^4|^} and 
Prob{A|z}. This section applies (3.1) to the Brain Data of Figures 1 and 2, 
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z value 

Fig. 4. Points are proportion of front-half voxels rAk, (3.2), for Brain Data of Figure 
2. Solid curve is tia(z), cubic logistic regression estimate of t^a{z) = Prob{A\z} ; Dashed 
curve tvao(z) estimates Probo{j4|z} as explained in text. 



taking the front-half voxels as class A. The front-back dichotomy is extended 
to a more realistic multi-class model in Section 4. 

In order to estimate tta(z), it is convenient, though not necessary, to bin 
the ^-values. Define rAk as the proportion of class A z-values in bin k, 

(3.2) r Ak = N Ak /N k , 

with N k the number of z^s observed in bin k, and iV^ the number of those 
originating from voxels in class A. The points in Figure 4 show r Ak for 
K = 42 bins, each of length 0.2, running from z = —4.2 to 4.2. As suggested 
by the right panel of Figure 2, the proportions r Ak steadily increase as we 
move from left to right. 

A standard weighted logistic regression, fitting logit^^^) as a cubic func- 
tion of midpoint Zfj~) °f bin k, with weights N k , yielded estimate tta(z) 
shown by the solid curve in Figure 4. Binning isn't necessary here, but it 
is comforting to see tta( z ) nicely following the r^k points. (The three bins 
with r^k = at the extreme left contain only N k = 2 Zj's each.) 

In order to estimate 7rAo(z), we need to make some assumptions about 
the null distributions in the Two-Class model of Figure 3. Following Efron 
(2004a, 2004b), we assume that fAo(z) and fBo(z) are normal densities, but 
not necessarily A^(0, 1) as in (1.1), say, 

(3.3) fAo(z) ~N(5 A o,(Tao) and fBo(z) ~ N(5 B o,(? 2 m ). 
Bayes theorem then gives 

■XAo(z) _ TTAo(z) 
TTbo(z) l-TTAo(z) 
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Table 1 

Parameter estimates for the null arms of the 
Two-Class model in Figure 3, brain data. Obtained 
using R program locfdr, Efron (2007b), MLE fitting 
method 















Po 


S 


(To 


7T 


A (front): 


0.97 


0.06 


1.09 


0.50 


B (back): 


1.00 


-0.29 


1.01 


0.50 



(3.4) 

-5 A o\ 2 fz-5 B0 \ 2 



^APAO^BO 

■ exp< — 0.5 



O~A0 J \ &B0 



ttbPbovao 

which is easily solved for ttao(z). 

The R algorithm locfdr used "MLE fitting" described in Section 4 of 
Efron (2007b) to provide the parameter estimates in Table 1. (The front- 
back dividing line in Figure 1 was chosen to put about half the voxels into 
each class, so tta/ttb = 1-) Solving for ttao hi (3.4) gave the dashed curve of 
Figure 4. 

Looking at Figure 2, we might expect fdi a(z), the local fdr for the front, 
to be much lower than the combined fdr(z) for large values of z, that is, to 
provide many more "significant" z-values, but this is not the case: in fact, 

(3.5) R A (z) = 7T A o(z)/7rA(z) = 0.94 

for z > 3.0, so formula (3.1) implies only small differences. Two contradictory 
effects are at work: the longer right tail of the front-half distribution by itself 
would produce small values of Ra(z) and fdr^(z); however, this effect is 
mostly canceled by the rightward shift of the whole front-half distribution, 
which substantially increases the numerator of fdr^z) = PAofAo(z)/ fA(z), 
(2.11). [Note: the "significant" voxels in Figure 2 were obtained using the 
theoretical null (1.1) for both the separate and combined analyses, making 
them somewhat different than those based on the empirical null estimates 
here.] 

The close match between ttao(z) and tta(z) near z = is no accident. 
Following through the definitions in Figure 3 and (2.11) gives, after a little 
rearrangement , 

^ 36 ^ n A (z) _ TTAo(z) 1 + Qa(z) 



l-ir A {z) 1-ttao(z)1 + Qb{z) 
where 

(3 - 7) QA{Z) = M.A(Z) ^ Qb[Z)= fdTB(z) 
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Fig. 5. Brain Data: z-values plotted vertically versus distance x from back of brain. Small 
histogram shows x values for the 198 voxels with z% > 3.02, left panel of Figure 2. Running 
percentile curves reveal general upward shift of z-values near histogram mode at a; = 65. 



Usually fdr^(z) and fdr#(z) will approximately equal 1.0 near z = 0, reflect- 
ing the large preponderance of null cases, (2.3), and the fact that nonnull 
cases tend to produce z-values further away from 0. Then (3.6) gives 

(3.8) tta(z) = ttao(z) for z near 0, 

as seen in Figure 4. 

Suppose we believe that fAo(z) = fBo(z) in Figure 3, in other words, that 
the null cases are distributed identically in the two classes. [This being true, 
e.g., if we accept the theoretical null distribution (1.1), as is usual in the 
microarray literature.] Then ttao(z) will be constant as a function of z, 

KAPAQ kaPao 



(3.9) ttao(z) 



KAPA0 + KBPBO P0 



Since tta(z) m Figure 4 is not constant near z = 0, and should closely ap- 
proximate ttao(z) there, we have evidence against /ao(z) = Ibq{z) in this 
case, obtained without recourse to models such as (3.3). 



4. Fdr estimation for small subclasses. Our division of the Brain Data 
into front and back halves was somewhat arbitrary. Figure 5 shows the 
N = 15443 z-values plotted versus x, the distance from the back of the 
brain. A clear wave is visible, cresting near x = 65. Most of the 198 BH(0.1) 
significant voxels of Figure 2 occurred at the top of the crest. 

There is something worrisome here: the z-values near x = 65 are shifted 
upward across their entire range, not just in the upper percentiles. This 
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Fig. 6. Left panel: Solid histogram shows 82 z-values for class A, the voxels at x = 18; 
Line histogram for all other z-values in Brain Data; "cub" is cubic logistic regression 
estimate of 1/ Ra(z) . Right panel: combined local false discovery rate fdr(z), obtain from 
locfdr algorithm, and fdrA(z) = fdr(z) ■ Ra(z). Dashes indicate the 82 ZiS. 



might be due to a reading bias in the DTI imaging device, or a genuine dif- 
ference between dyslexic and normal children for all brain locations around 
x = 65. In neither case is it correct to assign some special significance to those 
voxels near x = 65 having large Zi scores. In fact, a separate Fdr analysis of 
the 1956 voxels having x in the range [60, 69] [using locfdr to estimate their 
empirical null distribution as iV(0.65, 1.44 2 )] yielded no significant cases. 

Figure 5 suggests that we might wish to perform separate analyses on 
many small subclasses of the data. Large classes can be investigated directly, 
as above, but a fully separate analysis may be too much to ask for a subclass 
of less than a few hundred cases, as shown by the accuracy calculations of 
Efron (2007b). Relationship (3.1) can be useful in such situations. 

As an example, let class A be the 82 voxels located at x = 18. These 
display some large z-values in Figure 5, attained without the benefit of 
riding a wave crest. Figure 6 shows their z-value histogram and a cubic 
logistic regression estimate of tta.(z)/tva, where 

(4.1) 7ta = 82/15543 = 0.0053. 

This amounts to an estimate of 1/Ra(z) in (3.1). Here we are assuming that 
A shares a common null distribution with all the other cases, /ao(z) = /bo(z) 
as in (3.9), a necessary assumption since there isn't enough data in A to 
separately estimate ttao(z). In its favor is the flatness of 7ta(z)/tta near z = 0, 
a necessary diagnostic signal as discussed at the end of Section 3. [Remark 



12 



B. EFRON 



G of Section 8 discusses replacing tta with ttao, (3-9), in the estimation of 
Ra(z).} 

The right panel of Figure 6 compares the combined local false discovery 
rate fdr(z), estimated using locfdr as in Efron (2007a), with 

(4.2) £v A (z) = Mi(z)-R A (z), 

for z > 0. The adjustment is substantial. Whether or not it is genuine de- 
pends on the accuracy of Ra(z), as considered further in the "efficiency" 
discussion of Section 7. 

So far we have only discussed the dichotomization of cases into two classes 
A and B of possibly separate relevance. Figure 5 might suggest a more con- 
tinuous approach in which the relevance of case j to case i falls off smoothly 
as \xj — Xi\ increases, for example, as 1/(1 + \xj — Xj|/10). We suppose that 
each case comprises three components, 

(4.3) casei = (xi,Ii,Zi), 

where Xi is an observable vector of covariates, Ii is the unobservable null 
indicator having Ii = 1 or as case^ is from the null or nonnull group in 
(2.1), and Zi is the observable z-value. We also define a "relevance function" 
p%{x) taking values in the interval [0,1], which says how relevant a case with 
covariate x is to case^, the case of interest. [Previously, Pi{xj) = 1 or as Xj 
was or was not in the same class as Xj.] 

The Two-Class model of Figure 3 can be extended to a multi-class model, 
where each covariate value x has its own distribution parameters p x o, fxo(z), 
and f x i(z), giving 

fx(z) =p x0 fxo(z) + (l-Pxo)fxi(z) and 

(4.4) 

fdv x (z) =Pxofxo(z)/f x {z) 

as in (2.11). Let fdr(z) be the combined local false discovery rate as in the 
Theorem of Section 2, and fdrj(z) the separate rate fdr Xi (z). Then (3.1) 
generalizes to 

(4.5) fdr,(z) = fdr(z) • R^z), R^z) = ^f}"] , 

E{ Pi {x)\z} 

"£^o" indicating null case conditional expectation. 

Tail area false discovery rates (2.5) also satisfy (4.5) after the requisite 
definitional changes, 

(4.6) Fdr,(z) = Pdr(z) • R i (z), R t (z) = ^^7^ 

E{pi(X)\Z < z) 

The empirical version of (4.6) clarifies its meaning. Let pjo and Fjo(z) 
indicate p x o and the c.d.f. of f x o(z) for x = xj, (4.4), and let N(z) = #{zj < 
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z}. Taking account of all the different situations (4.3), the combined Fdr 
estimate (2.5) becomes 

N 

(4.7) m(z) = '£ Pj0 Fj (z)/N(z), 

3=1 

this being the ratio of expected null cases to observed total cases for Zj < z.- L . 
Similarly, 

N 

(4.8) Fdii(z) = ^2p i (x j ) Pj0 F j o(z) / J2 Pi( x j)i 

the ratio of expected null cases to total cases taking account of the relevance 
of Xj to X\. Therefore, 

m l {z)=miz)-R i {z), 

(4.9) 

AZ) [Y. Zj < z Pi{x ] )/N{z)\ 

the denominator of Ri(z) is an obvious estimate of E{pi{X)\Z < z} in (4.6), 
while the numerator is the Bayes estimate of Eo{pi(X)\Z < z}. 

5. Are separate analyses legitimate? The principal, and sometimes only, 
concern of classical multiple inference theory was the control of Type I error 
in simultaneous hypothesis testing situations. This raises an important ques- 
tion: is it legitimate from an error control viewpoint to split such a situation 
into separate analysis classes? The answer, discussed briefly here, depends 
upon the method of inference. 

The Bonferroni method applied to N independent hypothesis tests rejects 
the null for those cases having p-value pi sufficiently small to account for 
multiplicity, 

(5.1) Pi<a/N, 

a = 0.05 being the familiar choice. If we now separate the cases into two 
classes of size N/2, rejecting for pi < a/ (N/2), we effectively double a. 
Some adjustment of Bonferroni's method is necessary if we are contemplat- 
ing separate analyses. Changing a to a/2 works here, but things get more 
complicated for situations such as those suggested by Figure 5. 

False discovery rate methods are more forgiving — usually they can be 
applied to separate analyses in unchanged form without undermining their 
inferential value. Basically, this is because they are rates, and, as such, cor- 
rectly scale with "iV." We will discuss both Bayesian and frequentist justi- 
fications for this statement. 
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Starting in a Bayesian framework, as in (4.3), let 

(5.2) (X,I,Z) 

represent a random case, where X is an observed covariate vector, / is an 
unobserved indicator equaling 1 or as the case is null or nonnull, and Z 
is the observed z- value. We assume X has prior distribution ir(x). X might 
indicate class A or B in Figure 3, or the distance from the back of the brain 
in Figure 5. 

Let Fdr x (z) be the Bayesian tail area false discovery rate (2.5) conditional 
on observing X = x, 

(5.3) Fdv x (z) = Prob{/ = l\X = x, Z< z). 

(Remembering that we could just as well change the Z condition to Z > z 
or \Z\ < z.) For each x, define a threshold value z(x) by 

(5.4) Fdx x {z{x))=q 

for some preselected control level q, perhaps q = 0.10. This implies a rule 1Z 
that makes decisions "I" according to (5.4), 

1 (null), if Z>z(X), 

(nonnull), if Z < z. 



(5.5) /: 



Rule 1Z has a conditional Bayesian false discovery rate q for every choice 
of x, 

(5.6) Fdr x . (ft) = Prob{J = 1\T= 0, X = x} = q. 
Unconditionally, the rate is also q: 

Fdr (ft) = Prob{7 = 1|J = 0} 

(5.7) = / Prob{/ = 1\Z <z(X),X = x}tt{x\Z <z)dx 

Jx 

= I q ■ ir(x\Z < z) dx = q. 
Jx 

This verifies the Bayesian separation property for false discovery rates: 
Fdx x (JZ) = q separately for all x implies Fdr(7£) = q. Separating the Fdr 
analyses has not weakened the Fdr interpretation for the entire ensemble. 

For any fixed value of z, the combined Bayesian false discovery rate Fdr(,z) 
(2.5) is an a posteriori mixture of the separate Fdr x (z) values, 

(5.8) Fdv(z)= [ Fdr x (z)vr(x|Z <z)dx, 

Jx 

by the same argument as in the top line of (5.7). There will be some threshold 
value "z(comb)" that makes 

(5.9) Fdr(z(comb)) =q, 
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defining a combined decision rule 1Z com \> that, like 1Z, controls the Bayes 
false discovery rate at q. Because of (5.8), z(comb) will lie within the range 
of z(X); 7£ C omb will be more or less conservative than the separated rule 
TZ as ,z(comb) < z(X) or z(comb) > z(X). 1 Not using the information in X 
reduces the theoretical accuracy of rule 7£ com t,; see Remark A of Section 8. 

Result (5.7) justifies Fdr separation from a Bayesian point of view. The 
corresponding frequentist /empirical Bayes calculations lead to essentially 
the same conclusion, though not as cleanly as in (5.7). 

In order to justify the empirical Bayes interpretation of Fdr(z) (2.6), we 
would like it to accurately estimate Fdr(z) (2.5). Following model (2.1), 
define 

(5.10) N(z) = #{zi < z} and e(z) = E{N(z)} = N • F(z); 
also let 

(5.11) D =™44 and i- 1 -^- 1 



Fdr(z) e(z) N F(z) 

Assuming independence of the z% (just for this calculation) standard bi- 
nomial results show that 

(5.12) E{D} = l + d and var{D} = d, 

where the approximations are accurate to order 0(1/A), ignoring terms 
0(1/N 2 ). Remark B improves (5.12) to accuracy 0(1/A 2 ). 

We see that Fdr(z) is nearly unbiased for Fdr(z) with coefficient of vari- 
ation 

(5.13) CX(Fdr{z)) = d 1 / 2 <e{z)- 1 ' 2 . 

We need e(z) to be reasonably large to make CV small enough for accurate 
estimation, perhaps 

(5.14) e(z) = N ■ F(z) > 10 for CV < 0.3. 

If we are working near the 1% tail of F(z), common enough in Fdr applica- 
tions, we need > 1000. 

Making "A" as large as possible in the best reason for combined rather 
than separate analysis, at least in an empirical Bayes framework. The sepa- 
rate analyses still have large A's in the front-back example of Figure 2, but 



1 Genovese, Roeder and Wasserman (2006) consider more qualitative situations where 
what we have called "A" or "B" might be classes of greater or less a priori null probability. 
Their "weighted BH" rule transforms z into values za or zb depending upon the class, 
and then carries out rule (2.5) on the transformed z's. Here, instead, the z's are kept 
the same, but compared to different thresholds z (x). Ferkinstad et al. (2007) explore the 
dependence of Fdr(z) on x via explicit parametric models. 
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not in Figure 6. Section 7 shows how the small subclass approach used in 
Section 4 can improve estimation efficiency. 

The danger of combination is that we may be getting a good estimate of 
the wrong quantity: if Fdr a(z) is much different than Fdr B (z), Fdr(z) may 
be a poor estimate of both. 

Returning to a combined analysis, let Nq(z) and N±(z) be the number of 
null and nonnull Zj's equal or less than z, for some fixed value of z, 

(5.15) N (z) = #{z i <z,I l = l} and N x {z) = #{z i <z,I i = ti\ 

in notation (4.3), so Nq(z) + N\(z) = N(z), (5.10); also let eo(z) and e\{z) 
be their expectations, 

e (z) = E{N (z) } = N Po • F (z) and 

(5.16) 

e 1 (z)=E{N 1 (z)} = N Pl -F 1 (z). 

The rule that declares Jj = if z% < z (i.e., "rejects the null" for z% < z) 
has actual /a/se discovery proportion 

(5.17) Fdp(z) - No{2) 



N (z) + N 1 (z) 



Fdp (z) is unobservable, but we can estimate it by Fdr(z) (2.2), equaling 
eo(z) I (Nq(z) + Ni(z)) in notation (5.15). This is conservative in the fre- 
quentest sense of being an upwardly biased estimate. In fact, it is upwardly 
biased given any fixed value of N\(z): 

eo(z) 



(5.18) > 



> E 



e (z)+N 1 (z) 
N (z) 



N (z) + N 1 (z 
E{Fdp(z) | iVi(z)} 



Ni(z) 



where Jensen's inequality has been used twice. Only the definition E{Nq(z) = 
eo(z) is required here, not independent z^s. 

Now suppose we have separated the cases into classes A and B, employ- 
ing separate rejection rules zi < za and Z{ < zb satisfying (in the obvious 
notation) 

(5.19) E{m A (z A )} = E{m B (z B )}=q. 

Applying (5.18) shows that the separate false discovery proportions will 
be controlled in expectation at rate q. However, for the equivalent of the 
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Bayesian result (5.7) to hold frequentistically, we want the combined False 
Discovery Proportion, 

9m pj = Na(za) + N B0 (zb) 

1 j Pc ° mb N A0 (z A ) + N B0 (z B ) + N A1 (z A ) + N m (z B ) ' 

to satisfy -E{Fdp com b} < q. Remarks C and D show that asymptotically 
(5.21) £{ Fdpcomb } = (? __^ + o(l/iV 2 ) 

for some c > 0, but an exact finite-sample result has not been verified. (It 
fails if the denominator expectations are very small.) 

Simulations show the original Benjamini-Hochberg rule behaving in the 
same way; applying rule (2.8) separately to classes A and B also controls 
the overall expected value of Fdp comb at rate q, in the sense of (5.20). But 
again this has not been verified analytically. 

The conclusion of this section is that separate false discovery rates anal- 
yses are legimate, in the sense that they do not inflate the combined Fdr 
control rate, at least not if the denominator expectations are reasonably 
large. 

6. Enrichment calculations. Microarray studies frequently yield disap- 
pointing results because of low power for detecting individually significant 
genes, Efron (2007a). "Enrichment" techniques strive for increased power 
by pooling the z-values from some pre-identified collection of genes, for in- 
stance, those from a specified pathway, as in Subramanian et al. (2005), 
Newton et al. (2007) and Efron and Tibshirani (2007). By thinking of the 
pooled collection as "class A" in (2.16), the Theorem of Section 2 can be 
brought to bear on enrichment analysis. 

Figure 7 involves a microarray study of 10,100 genes, featured in Subrama- 
nian et al. (2005), concerning transcription factor, p53. The study compared 
17 normal cell lines with 33 lines exhibiting p53 mutations. Two-sample t- 
tests yielded z-values Zi for each gene, but the results were disappointing: a 
standard Fdr test (2.8), with q = 0.1, yielded only one nonnull gene, "BAX." 

The solid histogram in the left panel of Figure 7 shows values for the 40 
genes in set "P53_UP," a collection of genes known to be up-regulated by 
gene p53. Compared with the line histogram of the 10,060 other z^s, the 
P53_UP set definitely looks "enriched," even though it contains only one 
individually significant z-value. 

The same analysis as in Figure 6 was applied in Figure 7, with class A 
now the 40 P53_UP genes. The right panel shows Fdr(z) for the combined 
analysis of all 10,100 genes [obtained from locfdr, using the theoretical null 
(1.1)], and 

^ ^ , , ^ , , n A 
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with tta = 40/10, 100 as in (4.1) and tta(z) obtained from a cubic logistic 
regression. We see that fdr a(z) is much smaller than fdr(z) for z > 2. Six of 
the P53_UP genes have fdr^(zj) < 0.10, now indicating strong evidence of 
being nonnull. 

The null hypothesis of no enrichment can be started as fdr^(^) = fdr(z). 
Assuming ttao(z) constant, as in (3.9), Theorem (2.16) provides the equiv- 
alent statement 

(6.2) enrichment null hypothesis : tta(z) = constant, 

so we can use tta(z) to test for enrichment. For instance, we might estimate 
7ta(z) with a linear logistic regression, and use the test statistic 

(6.3) S = (3/se, 

where is the estimated regression slope and 3e its estimated standard 
error. 

For the P53_UP gene set, (6.3) gave 5 = 4.54, two-sided p-value 6.10~ 6 . 
This agrees with the analyses in Subramanian et al. and Efron and Tib- 
shirani, both of which judged P53_UP "enriched," even taking account of 
simultaneous testing for several hundred other gene sets. (In situations like 
that of Figure 6, where the class A z^s extend across a wide range of neg- 
ative and positive values, S should be calculated separately for z < and 
z > 0; in Figure 6 the positive Zj's yielded S = 3.23,p-value 0.001.) 




-* -i 2 4 -2-1 T 2 3 A 

z values zyeiIw 

Fig. 7. p53 microarray study, 10,000 genes, comparing normal versus mutated cell lines. 
Solid histogram in left panel shows z -values for 40 genes in class P53JJP, compared with 
all others (line histogram). Right panel compares fdr(z) for all 10,000 genes with fdrA(z) 
obtained as in (4-12). Six of the P53JJP genes have fdryi(z) < 0.1. 



SEPARATING AND COMBINING HYPOTHESIS TESTS 



19 



Remark E connects (6.3) to more familiar enrichment test statistics, and 
suggests that it is likely to be reasonably efficient. The approach here has 
one notable advantage: we obtain an assessment of individual significance 
for the genes within the set, via fdr^Zj), rather than just an overall decision 
of enrichment. 

7. Efficiency. We used the Theorem of Section 2 to estimate fdr^(z) for 
subclass A: 

(7.1) fdr A (z)=fdr(z)^-, 

[setting ttao(z) in (2.6) equal to tt a in Figures 6 and 7]. Of course, we could 
also estimate, fdr A (z) or the tail area analogue Fdr^(z), (2.19), directly 
from the class A data alone, but (7.1) is substantially more efficient. This 
section gives a very brief overview of the efficiency calculation, with Remark 
E of Section 8 providing a little more detail, concluding with a simulation 
example supporting the accuracy of our methodology. 
Taking logarithms in (7.1) gives 

(7.2) logfdr A (z) = logfdr(z) + logR A (z), [R A {z) = 7r A /iv A (z)]. 

It turns out that log fdv(z) and log R A (z) are nearly uncorrelated with each 
other, leading to a convenient approximation for the standard deviation of 
log fdi A (z): 

(7.3) sd{logidr A (z)} = [( S 4logfdr(z)}) 2 + (sd{logR A (z)}) 2 ] 1 / 2 . 

Section 5 of Efron (2007b) provides an accurate delta method formula for 
sd{logfdr(z)}; sd{log R A (z)} is also easy to approximate, using familiar lo- 
gistic regression^ calculations. 

We expect idr A (z) to be more variable than fdr(z) since class A involves 
only proportion n A of all N cases; standard sample-size considerations imply 

(7.4) sd{logfdr A (z)} ~ -— sd{logfdr(z)} 

if idr A (z) and fdr(z) were estimated directly. Estimation method (7.1) does 
better — the extra variability added to fdr(z) by R A {z) = tt a /tt a (z), repre- 
sented by the last term in (7.3), tends to be much smaller than (7.4) suggests. 

Figure 8 illustrates a simulation example. In terms of the Two-Class model 
of Figure 3, the parameters are 

iV = 5000, VTA = 0.01, PA0 = 0.5, Pm = 1.0 

(7.5) 

}aq = /so ~ N(0, 1) and f A1 = ~ JV(2.5, 1); 
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Fig. 8. The three standard deviation terms of (7.3) for simulation model (7.5). 

so we expect 50 of the 5000 z^s to be from class A, and 25 of these to be 
nonnulls distributed as N(2.5, 1), the remaining 4975 cases being iV(0, 1) 
nulls. 

At z = 2.5, the center of the nonnull distribution, the ratio 



is 1.61, compared to the ratio 10 suggested by (7.4). Similar results were 
obtained for other choices of the simulation parameters, (7.5). See Remark H. 

8. Remarks. The remarks of this section expand on some of the technical 
points raised earlier. 

Remark A (Information loss if X is ignored). The observed covariate 
X in (5.2), or x\ in (4.3), is an ancillary statistic that affects the posterior 
probability of a null case, fdr x (z) = Prob{/ = \\X = x,Z = z}. General prin- 
ciples say that ignoring X will increase prediction error for I, and this can 
be made precise by considering specific loss functions. 

Suppose we wish to predict a binary variate I that equals 1 or with 
probability p or 1—p; for a prediction "P" in (0, 1), let the loss function be 



where q(-) is a positive concave function on (0, 1) satisfying q(0) = q(l) = 
[e.g-, q(p) =P- or q(p) = ~[plog(p) + (1 -p) log(l -p)}], and q(p) = 



sd{logfdrA(^)}/s(i{logfdr(2;)} 



(8.1) 



Q(I,P)=q(P)+q(P)(I-P) 
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dq/dp. This is the "Q class" of loss functions discussed in Efron (2004a). It 
then turns out that choosing P equal to the true probability p minimizes 
expected loss, with risk E{Q(I,p)} = q(p). 

Given Z = z, the marginal false discovery rate is 

(8.2) fdr(z) = Prob{/ = 1 | Z = z] = [ fdrJzWdz) dx, 

Jx 

similar to (5.8). Then, with fdr (2) or fdr x (z) playing the role of the true 
probability P, we have 

(8.3) ?(fdr(z)) =q(j f&r x (z)ir(x\z) dac\ > J q(f&r x (z))ir(x\z) dx 

by Jensen's inequality — in other words, the unconditional marginal risk 
q(f&v(z)) exceeds the expected risk conditioning on x. 

Remark B [Coefficient of variation of Fdr(z)]. Standard calculations 
involving the first three moments of a binomial variate yield the mean and 
variance of D = Fdr (z)/ Fdr (z), 

(8.4) D ~ (l + d-d 2 c, d-d 2 {6c- 1)) 

for d={\- F{z))/{N ■ F{z)) as in (5.11), and c = (1 - 2F{z))/{\ - F{z)), 
with errors 0(l/iV 3 ). This gives approximate coefficient of variation 

(8.5) CV(¥dr) = d x / 2 [l - d(3c - 1/2)], 
improving on (5.13). 

Remark C (Poisson model for Fdr relationship). Let Z indicate some 
region of interest in the space of z-values for Figure 3, for instance z < za 
in the A branch and z < zb in the B branch. Denote the number of null, 
nonnull, and total Zi values in Z as Nq(Z), N\(Z) and N{Z) = Nq(Z) + 
Ni(Z), with corresponding expectations eo(Z), e\{Z) and e(Z) = e${Z) + 
ei(Z). Section 5 considers the relationship of three quantities, 

(8,) = ^ = ^ Fd P ( 2 )^, 

the estimated Benjamini-Hochberg FDR, the Bayesian Fdr and the False 
Discovery Proportion. 

Now assume that iV in Figure 3 is Poisson with expectation //, and that 
the Zi's are independent, 

(8.7) A r ~Poi(/x), z\, Z2, zn independent, 
implying that Nq(Z) and N\(Z) are independent Poisson variates, 

(8.8) N {Z) ~ Poi(e (Z)) independent of N X (Z) ~ Poi(ei(Z)). 
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We can write 

(8.9) N (Z) = e (Z)+5 , 
where 5q has first three moments 

(8.10) 6 ~(0,e (Z),e (Z)), 

and similarly for N\(Z) = e 1 {Z) + b~\ and N(Z) = e(Z) + 5. 

Note. The independence in (8.7) is not a necessary assumption, but it 
leads to the neatly specific forms of the relationship below. 

The Poisson assumptions make it easy to relate the two random quantities 
Fdr(iJ) and Fdp(iJ) in (8.6) to the parameter Ydi{Z): 

Lemma. Under assumption (8.7), 

(8.11) £{Fdr"(Z)} = Fdr(Z) • (1 + l/e{Z)) + 0(l/e(Z) 2 ) 
and 

(8.12) E{Fdp(Z)} = Fdv(Z) + 0{l/e{Z) 2 ). 

Typically, e(Z) = e${Z) + e\{Z) will be O(fi), so that the error terms in 
(8.11)-(8.12) are 0(1//U 2 ), effectively 0{l/N 2 ). The Lemma shows that 
Fdr(Z), the Bayesian false discovery rate, is an excellent approximation 
to E{Fdp(Z)} , while E{Fdr(Z)} is only slightly upwardly biased. 



Proof. Following through definitions (8.6) and (8.9), 

e (Z) e {Z) 



Fdr(Z) - Fdr(Z) 



e(Z) + 5 e(Z) 



1.13) 



Fdr(Z) 



1 



= Fdr{Z) 



l + 5/e{Z) 

5 5 2 
+ 



e(Z) e{Z)\ 
so taking expectations yields (8.11). Similarly, 

l + 5 /e (Zy 



3.14) 



Fdr(Z) - Fdp{Z) = Fdi(Z) 



Fdr(Z) 



1 



l + 5/e(Z 
S 



1 + 



eo(Z) 



1 - 5/e(Z) + 



e{Zf 
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Therefore, 

E{Fdi{Z) - Fdp(Z)} 

(8.15) 



Fdr(Z) 
Fdr(Z) 



E 



s s 



e (Z)e(Z) 
eo(Z) e(Z) 



E 



e (Z)e(Z) e(Zy 



e (zy 

= 0. 



□ 



Remark D (Frequentist Fdr combination result). The Lemma above 
leads to a heuristic verification of (5.21): that under (5.19), E{FdT com ) 3 } < q 
in (5.20). To begin with, notice that (8.7) gives expected values of Nao(za) 
and N A (z A ), 

(8.16) e A0 (z A ) = fj,TT A p A0 F A0 (z A ) and e A (z A ) = fiir A F A (z A ), 
e A o(z A ) _ PaoFao(za) 



so 



U7) 



e A (z A ) F A (z A ) 



Fdr A (z A ), 



the Bayesian Fdr for class A, and similarly, e B o(z B )/e B (z B ) = Fdr B {z B ). 
From (8.11) and (5.19), applied individually within the two classes, 



3.18) 



Fdv a (za) = q 



Fdr b (zb) = q 



1 



1 



1 



e A {z A ) 
1 



< q and 



< 



e B (z B )_ 

Therefore, the combined Bayesian Fdr is also bounded by q, 

e A o(z A ) +e B o(z B ) 



Fdr com b 



3.19) 



e A {z A ) + e B (z B ) 

_ e A {z A )Fdi A {z A ) + e B (z B )Fdi B (z B ) 
e A (z A ) + e B (z B ) 

< e A {z A )q + e B (z B )q _ 



e A (z A ) + e B {z B ) 
But S{Fdp com b} = Fdr com b according to (8.12), verifying (5.21). 

Remark E {The slope statistic for testing enrichment). Slope statistic 
(6.3), S = f3/s~e, is asymptotically fully efficient for enrichment testing un- 
der a two-sample exponential families model. Suppose that all the z-values 
come, independently, from a one-parameter exponential family having den- 
sity functions 

^-^g (z), 



3.20) 
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as in Lehmann and Romano (2005), with 

(8 21) r?=<f ?M ' inclassA > 

\ rjBi in class B. 

Define (3 = T]a — Vb- As N — > oo, the MLE (3 has asymptotic null hypoth- 
esis distribution 

(8.22) $~n(/3, 



N-ka^bV^o) 



where tta and ttb are the proportions of ZiS from the two classes, and V(t]q) 
is the variance of z if t]a = t]B equals, say, 770- 
Bayes rule applied to (8.20)-(8.21) gives 



3.23) logit(7TA(^)) = (3z + c 



with (3 = rjA — riB as above. Standard calculations show that (3 obtained from 
the logistic regression model (8.23) also satisfies (8.22). This implies full 
asymptotic efficiency of the slope statistic (6.3) for testing r\A = f]B-, the "no 
enrichment" null hypothesis. Under normal assumptions, g^{x) ~ N(rj, 1) 
in (8.20), (6.3) is asymptotically equivalent to ~za — ~zb, the difference of 
class means; "limma" [Smyth (2004)], an enrichment test implemented in 
Bioconductor, is also based on za, as discussed in Efron and Tibshirani 
(2007b). 

Remark F [The additive variance approximation (7.3)]. Being a little 
more careful, we can use (3.9) to write (7.1) as 

(8.24) fdr A (z) = fdr(z)^fL L M = 5^9 ) , 

nA{Z) \ TTAPAO + KBPBOJ 

under the assumption that fAo(z) = fBo(z) in Figure 3. Binning the data as 
in (3.2) gives 

(8.25) £fdr Ak = fflr k - B Ak + log^o), 

where lldiAk is log(fdrA(z)) evaluated at the midpoint Za.\ of bin k, and sim- 
ilarly, ifdvk = log(fdr(z( fc ))) and IftAk = log (tta (;?(*;)))• Locfdr computes the 
estimates ifdik from the vector of counts N = (. . . , Nk, . . .), while a standard 
logistic regression program computes l^Ak from the vector of proportions 
va = (■ ■ ■ , rAk = ^Ak/Nk, ■ ■ •), (3.2); N and are, to a first order of calcu- 
lation, uncorrelated, leading to approximation (7.3). 

In broad outline, the argument depends on the general equality 

(8.26) v&r{X + Y} = var{X} + var{y} + 2cov{X, E(Y\X)}, 
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applied to X = iidik and Y = —litAk- Both var{X} and var{y} are 0(1/N), 
but, because the expectation of does not depend on N, the covariance 
term in (8.26) is of order only 0(1/N 2 ), and can be ignored in (7.3). 

Remark G (The assumption of identical null distributions). If we are 
willing to assume that /ao( z ) = fBo( z ) in Figure 3 [or equivalently, that 
Iao(z) = fo{z), the combined null density], then relationship (3.1) becomes 

(8.27) fdr A (z) = fdr(z) ^ [t^ao = ^APAo/Po], 

KA{Z) 

as in (3.9), which can be written as 

VTA PAO 



3.28) fdr A (z) =fdr(z) 



ka(z) po 



The examples in Figures 6 and 7 estimated R a {z) = t^ao/^a(z) by i^a/^a{z), 
ignoring the final factor pao/po m (8.28). This is probably conservative: one 
would expect a small subclass A of interest to have proportionately more 
nonnull cases than the whole ensemble, in other words, to have pao/po < 1- 

It isn't difficult to estimate the full relationship (8.27). Since ttao = tta(z) 
for z near 0, (3.8)-(3.9), we can set 

(8.29) £ Ta{z) = Mv(z)^\; 

(8.29) gives better results if the logistic regression model for estimating tta(z) 
incorporates a flat interval around z(0) — for instance, if only positive values 
of z are of interest, 

(8.30) ]D&t(ir A (z)) =Pl + /?2max(z - 1,0) 2 + /9 3 max(z - 1,0) 3 . 

Remark H (Simulation example). Figure 9 graphs 100 simulations of 
fdr^-z), (7.1), drawn from model (7.5). The comparison with the actual 
curve fdr^z) shows excellent accuracy. Here neither the simulations nor the 
actual curve incorporate the factor pao/po m (8-28), which could be included 
as in (8.29)-(8.30). 

A simpler correction starts with fdr a(z), (7.1), estimates pao by 

(8.31) PAa = ^2^A(zi)/N A , 



and finally multiplies fdr^(zj) by pao/po- in the simulations for Figure 9, 
Pao had mean 0.575 and standard deviation 0.035, reasonably close to the 
true value pao = 0.50. 
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m tii 2.0 25 3.0 3.5 4.0 4.5 

Fig. 9. Light lines show 100 simulations 0/ fdryi(z), (7.1), /rom model (7.5); heavy line 
is actual fdrA(z) curve. [Factor pao/po in (8.28) not included in actual or simulations.] 
Also shown is combined rate fdr(z). Formula (7.1) provides good accuracy in this case. 

9. Discussion and summary. A more accurate title for this paper might 
have been "When shouldn't hypothesis testing problems be combined?" A 
general algorithm for combining or separating problems is beyond my scope 
here, but the analysis makes it clear that combination can be dangerous in 
situations like that of Figure 5. On the positive side, the simple Bayesian 
theorem of Section 2, extended at (4.5), helps signal if separation is called 
for, and even how it can be efficiently carried out. Some specific points: 

• Combining problems increases the empirical Bayes inferential accuracy, 
with N > 1000 necessary for reasonably accurate direct estimation of 
false discovery rates (5.13)-(5.14), at least in the partially nonparametric 
framework of Benjamini and Hochberg (1995). 

• However, the Separate-Class model of Figure 2, and its ensuing theorem 
(2.16), imply that separate inferences can be necessary for problems of dif- 
fering structure. The question of whether to combine problems amounts, 
here, to a question of trading off variance with bias in the estimation of 
false discovery rates. 

• Situations like that of Figure 5 argue strongly against a single combined 
analysis. The theorem can be implemented as in Figure 6 to estimate 
fdr or Fdr for small subclasses, N = 82 in Figure 6, and with surprising 
accuracy as shown in Section 7. 

• A formal test for separation can be based on the slope statistic (6.3). This 
provided strong evidence for the necessity of separation in the p53 enrich- 
ment example of Figure 7, and moderately strong evidence in Figure 6. 
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order read 

Fig. 10. Paired t-statistics comparing affected versus nonaffected tissue in 13 cancer 
patients; microarray study of 12625 genes. The t-values are plotted vertically, against 
the order in which they were read from the array. Smoothing spline (solid curve) reveals 
periodic disturbances. 



• Section 5 shows that controlling the false discovery rate in separate classes 
also controls it in combination, at least if the expected number of tail 
events isn't too small. In this sense, Fdr analysis has an advantage over 
other simultaneous testing techniques. 

Whether or not the specific methodology presented here appeals to the 
reader, the general question of which problems to combine in a simultaneous 
testing situation remains important. As a matter of due diligence, plotting 
test statistics versus possible covariates, as in Figure 5, can raise a warn- 
ing flag against casual combination. Such covariates exist even in loosely 
structured microarray studies — where, for example, the order in which the 
expression levels are read off the plate can reveal noticeable effects. 

This last point is illustrated in Figure 10, where a periodic disturbance 
in the microarray reading mechanism has evidently affected the gene-wise 
summary statistics. Subtracting the estimated disturbance function from the 
observed i-statistics is an obviously wise first step. Adjustments that make 
cases more comparable are a complementary tactic to separate analyses. 
Both can be useful in large-scale testing situations. In Figure 5, for example, 
we might adjust the z-values by subtracting the local median and dividing 
by the local spread (84%-16%). The resulting version of Figure 5, however, 
still displays obvious inhomogeneity, and requires separate analyses like that 
in Figure 7 to ferret out the interesting cases. 
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